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Abstract 

In this paper we address issues of reliability of RAID sys- 
tems. We focus on "big data" systems with a large number 
of drives and advanced error correction schemes beyond 
RAID 6. Our RAID paradigm is based on Reed-Solomon 
codes, and thus we assume that the RAID consists of N 
data drives and M check drives. The RAID fails only if 
the combined number of failed drives and sector errors 
exceeds M, a property of Reed-Solomon codes. 

We review a number of models considered in the litera- 
ture and build upon them to construct models usable for a 
large number of data and check drives. We attempt to ac- 
count for a significant number of factors that affect RAID 
reliability, such as drive replacement or lack thereof, mis- 
takes during service such as replacing the wrong drive, de- 
layed repair, and the finite duration of RAID reconstruc- 
tion. We evaluate the impact of sector failures that do not 
result in drive replacement. 

The reader who needs to consider large M and N will 
find applicable mathematical techniques concisely sum- 
marized here, and should be able to apply them to simi- 
lar problems. Most methods are based on the theory of 
continuous time Markov chains, but we move beyond this 
framework when we consider the fixed time to rebuild 
broken hard drives, which we model using systems of de- 
lay and partial differential equations. 

One universal statement is applicable across various 
models: increasing the number of check drives in all cases 
increases the reliability of the system, and is vastly su- 
perior to other approaches of ensuring reliability such as 
mirroring. 

1 Introduction 

RAID technology (see [7]) has a single primary focus: to 
apply mathematical techniques to organize data on multi- 
ple data storage devices such that in the event of one or 
more device failures, the original data stored is still avail- 
able. In recent years, this feature has become much more 
important. The reason for this is the rapid erosion of the 
relative reliability of data storage devices. 

For example, according to the manufacturer, a typical 



high capacity disk drive will experience an unrecoverable 
read error in every 1 in 10 14 bits, as discussed in ||4). The 
same drive can transfer data at a rate of 6 x 10 8 bits. At 
this transfer rate, this data storage device will lose data 
every 1.67 x 10 5 seconds, or roughly every 2 days. 

To build high capacity storage systems, a hundred or 
more disk drives may be deployed, combined in a single 
system. Using the manufacturer's projections, this system 
would lose data every 30 minutes. Hence, the impera- 
tive for reliable mathematical techniques to recover from 
this data loss are becoming increasingly urgent for sys- 
tems that store "big data." 

There exists another important class of failure that is 
critical to overcome in order to ensure that the original 
data is still available: silent data corruption. This occurs 
when a storage device delivers incorrect data, and reports 
it as correct. These events are well known and have been 
repeatedly measured and documented in the industry. Not 
a single storage device manufacturer supplies a specifi- 
cation as to how often these events will occur. A well 
designed RAID system, to be truly resilient in the face of 
all these failure scenarios, must be able to detect and cor- 
rect as well as recover from a wide class of reported and 
unreported error scenarios. 

In this paper, we attempt to quantify the increased re- 
liability that is achieved by constructing RAID systems 
with more robust error correcting codes (ECC). Standard 
RAID ECCs are often termed RAID 6, and are resilient in 
the face of two drives failing. However, for large systems, 
these codes are simply not robust enough, especially in 
the light of "expected" failures that occur every few min- 
utes. With only two drive protection, service events must 
be scheduled quickly, or the system runs a serious risk of 
permanent data loss. Employing additional check drives 
allows for fewer service events that can be scheduled with 
more flexibility. 

Well known Reed Solomon ECC codes can extend the 
reliability of RAID systems to tolerate many more re- 
ported failures and succeed in delivering correct data even 
in the face of silent data corruption. Since there is no ac- 
curate way to project the underlying reliability or correct- 
ness of the individual data storage devices, we propose 
that employing a more resilient mathematical technique is 
imperative to the design of future RAID systems intended 
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to store "big data." 

1.1 Measures of RAID Reliability 

The most common metric used to measure the reliabil- 
ity of a RAID system is the Mean Time to Data Loss 
(MTTDL), which measures the average time it takes for a 
given RAID system to experience a failure in which data 
is irrecoverably lost. However, MTTDL can be difficult to 
interpret and somewhat misleading, as discussed in iflOl , 
0. 

Here we will focus on the Probability of Data Loss 
within a specified deployment time t (PDL,). This mea- 
sure is more useful to a user of a RAID system than 
MTTDL in that it allows the user to think in terms of the 
acceptable risk of losing data during the expected lifetime 
of the data storage system, and is a more nuanced measure 
than MTTDL. However, to provide easy comparison with 
results from other authors, we will discuss the MTTDL 
for our models, as well as the PDL r . Some authors have 
chosen to focus on how much data a RAID system expects 
to lose in a given deployment time. We argue that any data 
loss is unacceptable and thus focus solely on whether or 
not data is lost, not how much. 



2 Model 1: No Repair 

Let us start by exploring a particularly simple model of 
the reliability of a RAID system. Although this model is 
not nuanced, it will allow us to develop the relevant math- 
ematical techniques in a case where analytical solutions 
are both tenable and concise. In future sections, we will 
build off this model to create more complex and realistic 
models of RAID reliability. Such models are examples of 
birth and death chains, which have been well studied in 
the mathematical literature; see Q and lfl4l . for example. 

Consider a RAID system consisting of N data drives 
plus M check drives. There are T = N + M drives in the 
system, with a storage rate N/T. RAID storage systems 
are such that the system can tolerate and recover from up 
to M drive failures; if there are M + 1 or more failures, 
data is irretrievably lost. See [8| and for the mathe- 
matics of such systems. 

In this first "no repair" model, we assume that drives 
are never repaired or replaced; if a drive fails, the system 
continues to operate without it. So long as no more than 
M drives fail, the system is functional and all data can be 
read and new data can be written. Maintenance can be 
expensive, so engineering a system that will never need 
to be touched by human hands within some fixed deploy- 
ment time might be a good design solution. Thus although 
this model is simple, it is also realistic. 

We model this system as a discrete state, continuous 
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Figure 1 : Markov chain for the no repair model of RAID 
reliability. 

time Markov process with M + 2 states, as shown in Fig- 
ure [T] State i indicates that i drives have failed. The sys- 
tem is initialized in state with all drives working. When 
a drive fails the system moves from state i to state i + 1 . 
If drives fail independently at a constant rate of failure A 
per drive, then the system moves from state i to i + 1 with 
an effective failure rate A; = (T — i)X. If the system enters 
state M+l, the failure state, then the RAID system has 
failed, and data has been lost. 

The probability distribution on the set of states is a 
probability vector 

q( f ) = (qo(t),qi(t),... 1 q M +l{t)) T 

where qj(t) is the probability that the system is in state j at 
time t . Thus, L^j 1 Qj(t) — 1. The evolution of the prob- 
ability distribution q(f ) is governed by the system of or- 
dinary differential equations (the Kolmogorov-Chapman 
equations): 



dt 



q=A(?)q, q(0) = (l,0,...,0) 7 



(1) 



where A(t) is the transition matrix. For the no repair 
model, 
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and A,- = — (T — i)X. Notice that the eigenvalues of A are 
and -A,- for < i < M. 

If X(t) is the state of the system at time t (so X(t) € 
{0, 1,2,... ,M + 1}) then the transition probability satis- 
fies the equation: 

P(X{t+h) = j\X(t) =k)= a jk (t)h + o{h) 

If the Markov process is stationary, the matrix A (t) is con- 
stant; we will restrict our models to this case. From the 
probability nature of the matrix it follows that for all t > 
Hj a jk(t) = 0. The differential equation (|TJ has solution 



q(*)=exp(*A)q(0) 



(3) 
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where the matrix exp(f A) is the matrix exponential, i.e. 
the sum of the series J^° =0 (t k /kl)A k (see [12], p. 206). 

2.1 Calculation of PDL, 

From the solution provided in p), it is easy to identify the 
PDL, for the model, it is simply the probability of being 
in the failure state at time t: PDL, = qM+i(t). For most 
models presented here, finding an analytical form for this 
quantity will be intractable, or will result in an expression 
that is too long to include. However, the no repair model 
is simple enough that we can tackle the PDL, calculation 
directly. 

The calculations simplify significantly if we consider a 
slightly modified model. The model consists of T drives 
without partitioning the system into the data and check 
drives. We run this system until the last drive fails. How- 
ever, formally this system is identical to the original sys- 
tem with exactly M' = T — 1 check drives and A^' = 1 data 
drives. We consider the system failed when M + 1 or more 
drives have failed. Thus, in this new framing, the quantity 
of interest is: 

T M 
PDL,= £ q, (f) = l-£«(f)» 

i=M+\ i=0 

and we find it by explicitly calculating q = exp(fA)q(0). 
The diagonalization of A, S~ l AS = D, where D is a di- 
agonal matrix and S is invertible, can be found explicitly. 
Clearly, D„ = —A, = — (T — i)X and we found the follow- 
ing expression for the entries of S: 



S k , = (-1) 



k—l 



T-l 
k-l 



Thus S is a lower-triangular matrix whose entries are bi- 
nomial coefficients, up to the sign. The columns of S are 
the right eigenvectors of A. The left eigenvectors of A are 
the row vectors which are the rows of S 1-1 . We found the 
following expression for the entries of S . 



^kl — 



T-l 
k-l 



This is again a lower-triangular matrix of binomial coeffi- 
cients. 

When the matrix A is diagonalizable, the explicit for- 
mula for exp(f A) is: 



T 

E 

i=0 



exp(f A) = £ e ' Xi Vj-wf 



(4) 



where v, is the 2-th right eigenvector (z'-th column of S) 
and wf is the z'-th left eigenvector (z'-th row of S ). Since 



T-i\ /r x 



q(0) =e = (1,0,0,. ..,0), 

?*(0 - L (e[v,) (wf e ) - £ e^S ki S m 

i=0 i=0 

(=0 V ' ' 

We note the identity 

'r-i\/T\ (T-i) 

k-i)\i) ~ {T-k)\{k-i) \ (T -/)!/! 



k\ 



(T — k)\k\ (k-i)\i\ 



" r '\ -(T-k)lty f k \ (_ X }k-i e -{k-i)X 

i=0 \V 



This give the following: 

9k(t) = 

= (l)e-( T -® Xl (l-e- lt ) k . 

This is abinomial distributionB(r,/?), where p = 1 — e^^' 
is the probability of success, as expected, because we 
may consider the survival of each disk as an independent 
Bernoulli trial. Hence, 



-(T-k)Xt n _ e -Xt\k 



(l-e- M f. (5) 



This formula is adequate for calculations and is numer- 
ically stable. It may also be approximated by the left 
tail of the normal distribution N{T(\—p), JT p{\ — p)), 
based on the Central Limit Theorem (CLT). However, we 
can do better. Let & be the time at which data is lost. 
Then F(t) = PDL, = P{& < t) is the cumulative dis- 
tribution function of 3~ . We find the p.d.f. of 3T us- 
ing F'(t) = — L^Lo^W- Using the system of differen- 
tial equations satisfied by q k {t) (q' (t) — —Xoqo(t) and 
q' k {t) = -X k q k (t)+X k _iq k _i(t) for > 1) we find: 

M 

F'{t) = Ao9o(0-L(-4%W+Vi%-iW) 

= X„q M {t) = XT l \ e-'V-^ (1 - e^' f- 

Also, F(0) = 0, and thus 

PDL, = XT l \ \l^ J - M)X {\-e- Xt ) M dt. 

We use the substitution u = 1 — e~ Xt , du = X{\ — u)dt, 
and obtain 

(T _ 1 \ putt) 
m ) I ( i ~ u ) T ~ MuM ( i - u y ldu 

T\ 

M\{T -M-l)\ 



J\ f u(t) 



T - M - 1 u M du. 
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This means that the random variable % given by the trans- 
formation ty£ = 1 — exp (—A 3?) has beta distribution 



f(u;a,P) 



1 



B(a,j8 



(6) 



with parameters a = M + 1 and j3 = T —M. We note that 
the transformation is the c.d.f. of an exponential model. 
If & were exponentially distributed with this c.d.f. the 
resulting 'W would have uniform distribution on the inter- 
val [0,1]. This highlights the stark difference between the 
uniform time to failure for a single drive, and the the time 
to data loss for a RAID system, which has small spread 
around its MTTDL. 

2.2 Calculation of MTTDL 

In this paper we will develop several methods for calculat- 
ing MTT DL; t he following method is based on the results 
of Section 2. 1 The inverse formula 5? = — \ log (1 — ty) 
allows one to compute MTTDL. The variable V = 1 — ^ 
is beta distributed, with the parameters a and j3 swapped. 
We find from standard sources on beta distribution that 

Elog(r) = yr(a) - y{a +j8) = y{T -M) - yr(T + 1) 



where y is the digamma function. Hence, 
MTTDL = E,"7 



| E io g r^ (r + 1) -^- M) . 

A A 



The digamma function is the only solution of the func- 
tional equation y/(x + 1) = yf(x) + l/x monotone on 
(0,oo) and satisfying y(l) = — y, where y is the Euler- 
Mascheroni constant. Therefore, we can show easily that 



MTTDL : 



1 



T 

E 



k=T-M 



Yet another method uses the widely known fact that 
MTTDL = fnR(t)dt, where R(t) = l-F(t) is the relia- 
bility function, and it uses formula Q. This calculation is 
left to the reader. A more general, related approach, based 
on the Laplace transform is presented in the next section. 

2.3 An Analytical Approach to MTTDL 

The moments of q(f) are the quantities mj(q) — 
Jo't k q(t) dt. These are vector quantities, but typically we 
will only be interested in the last component, mj c (qM+i), 
as it is related to the distribution of the time of failure. In 
particular, MTTDL = = tn\(qu+i)- Our exact ana- 
lytical technique to calculate the moments is based on the 
Laplace transform: q(z) = JJ° e~ zt q(t)dt. We recall that 
the Laplace transform is the moment generating function: 



q(z) = E(-i)*n«*(q)- 

k=0 K - 



(7) 



We can obtain a second expression for q(z) by consid- 
ering the resolvent of the matrix A: R(z;A) — (z/— A) -1 . 
The quantity z is complex, and R(z;A) is a complex 
matrix-valued function meromorphic in the entire com- 
plex plane. Its poles are exactly the eigenvalues of A. A 
well-known identity in operator theory relates the Laplace 
transform to the resolvent and the initial condition: 

q( Z )=/?(z;A)q(0). 

The resolvent is a rational function, and in applications to 
continuous-time Markov chains, is an eigenvalue of A. 
The resolvent therefore has a pole at z = of some order 
V. For this reason it is convenient to consider the Laurent 
series of R(z;A) at z = 0: 



R(z;A)= E 

k=-\ 



The matrices Z?W are constant. When z = is a sim- 
ple eigenvalue (common case) then v = 1 and R\ 1 ' (the 
residue) admits an expression in terms of the left and 
right eigenvectors of A with eigenvalue 0. Let u T A = 
and Av = 0, scaled so that u T v = 1. Then R(~ l > = \u T 
is a spectral projection on the eigenspace of 0. When 
u r q(0) = 0, as is the case in the models presented here, 
we have 

q(z) = £z**<*>q(0). 

k=0 

Comparing this equation with |7]), we obtain an expres- 
sion for the moments of q in terms of the Laurent series 
of the resolvent: fn^(q) = (— l) k k\RW q(0). This is our 
main device to calculate means and variances of the time 
of failure. These formulas are convenient to use with com- 
puter algebra systems (CAS). 

The relationship between the moments of the time of 
failure & and the resolvent can be explained easily. If 
r is the initial state (qj{Q) = 8j r ), and the failure state 
is s, then the probability F(t) =PDL r of transitioning to 
the failure state before time t is the entry (e ) sr of the 
fundamental matrix. Therefore, the entry R sr (z;A) = 
(R(z;A)) S) r of the resolvent is the Laplace transform of the 
PDL r . The function F(t) is the c.d.f. of the time to fail- 
ure. The k-th moment of this distribution is J^t k dF(t). 
Stieltjes integration by parts formula yields: mi(^) = 
J?t*dF(t) = -fft*d(l-F(t)) = -t k (l -F(t))\o + 
fQ(l-F(t))kt k ~ l dt = km k -i(l-q 3tr (t)). The Laplace 
transform of 1 — q s ,r(t) is l/z — q s . r {z) = l/z — R s ,r(z',A). 
Therefore, there is an explicit expression of the moments 
of the time to failure in terms of the Laurent series of the 
resolvent: 



m k {3r) = {-\) k k\R%r' ) . 



(8) 
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2.4 Summary of Results 



In Section |2T| we showed that if 2T is the random variable 
representing the time to failure, then the random variable 

% = l-exp(-A^) 

is beta distributed with parameters a = M + 1 and =N. 
This provides a complete probabilistic description of 3T . 
We derived the explicit formula 



PDL, = 1 - T 



T-l 
M 



M 



,-t{T-i)X 



In Section 2.2 we derived an explicit formula for 
MTTDL: 



We can also numerically compute PDL r for this model 
using Op. We will use the value A = 10 y eais for numerical 
computation, and assume that the expected deployment 
time for the system is five years. We believe this value for 
A is realistic, but perhaps conservative, based on the dis- 
cussion of real world drive failure rates in ifTD . We do not 
expect to obtain numerical results that are particularly ex- 
act, and thus will not dwell upon the value of such param- 
eters used here. Instead, we hope to better understand the 
effect of using additional check drives on the reliability of 
the system by examining numerical results qualitatively. 

Figure|2]shows PDL5 under a no repair model as a func- 
tion of N for five values of M. Notice that to maintain a 
particular level of reliability (PDL5 value), more check 
drives are required as the number of data drives increase. 



M 1 1 T 1 
MTTDL. £^ = A £ - k . 

i=0 ^' n k=T—M * 



3 Model 2: Individual Drive Repair 



Let us consider a model of RAID reliability as in Sec- 
We note that for large T we have the following approxi- tion[2) but now we allow failed hard drives to be repaired 
mation: one at a time. Other authors have also considered this 

model; see [10|, 0, 0, for example, or fj] for a similar 



1 rM/T 

~kjT T ~ ~X Jo Y-11 









-f. 













1 M 1 1 1 rM/T 1 

MTTDL = y 1 z du model. 

^ k=o 1 



This formula demonstrates that we may increase MTTDL 
to an arbitrarily large value by increasing the ratio of 
check drives to data drives to infinity. The total number 
of drives, T, grows exponentially with the target MTTDL. 
However, MTTDL grows linearly with M if M -C T. 

Thus, when there are relatively few check drives as 
compared to the number of data drives, the economical 
way to double MTTDL is clear: double the number of 
check drives. 
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Figure 3: . Markov chain for the individual repair model. 

The inclusion of repair yields a new Markov chain, 
shown in Figure [3] The modeling of drive failures is the 
same as in Section [2] When a drive is repaired the sys- 
tem moves from state i to state i — 1. If drives are repaired 
at a constant rate jj., independent of the number of failed 
drives, the system moves from state i to i — 1 with effective 
repair rate jU, = jU. 

3.1 Simple Case: RAID 4/5 

A simple case of this model is RAID 4/5 which, in 
essence, have data drives and M = 1 check drive, and 
the failure of any two drives is fatal. The states in this 
model are 0, 1 and 2, and the transition matrix is: 



10 20 30 40 50 60 70 

N 



Figure 2: PDL5 for the no-repair model. 



-(N+l)k n 0' 
(N+\)X -IN -/i 
AjV 



The entry /?2,o(z) of the resolvent of A is of interest: 



(9) 



*2,0(Z) = 



(n 2 +n) a 2 

(N 2 +N)X 2 z+((2N+l) A +Ai)z 2 +z 3 
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By the Laurent series of /?2,o(z) and (Tsl we find: 



mi(^) 



(2N+l)X + H 

N(N+l)X 2 
(6N 2 + 6N + 2) 



A 2 + (8A' + 4) l uA+2/i 2 



(A' 4 + 2A' 3 +A' 2 ) A 4 



This result for m\ (£?) = MTTDL is consistent with Plank 
in 1 10 1 . The variance may be computed using the formula 
var(^) = ni2{S?) — m\{^) 2 and is useful when applying 
techniques such as the Tchebycheff inequality to estimate 
the confidence interval for . After some simplifications 



var(^) 



(IN 1 



-2AT+1) l 2 + (4N + 2) AiA + /x 2 
(N 4 + 2N 3 +N 2 ) A 4 



Our technique generalizes easily to more check drives 
(e.g. 2, 3 and 4), but the expressions are too large to in- 
clude in this paper. It is worth noting that moment calcu- 
lations do not require eigenvalues of A. In contrast, cal- 
culating PDL,, or equivalently, exp(f A) is typically per- 
formed via diagonalization of A, and is easily done using 
numerical techniques. Limited theoretical results can be 
obtained, but they are beyond the scope of this paper. 




Figure 4: PDL5 for the individual repair model. 



additional data drive will only decrease the reliability of 
the system slightly. Therefore, the largest RAID systems 
are able to achieve the best reliability at the lowest cost 
per byte stored. 



3.2 General Case 



For arbitrary M, the transition matrix is given by: 



-Ao 
A<> 





Ml 

-(Al+jUl) M2 

A, -(X1 + H2) 

A, 















Mm 

-(Am + Mm) 

X M 



with Xj = (T — j)X, and flj = ji. 

As in Section [2~4] we can numericaly compute PDL5 
for this model. We use a repair rate of jx — 6 A— for our 
calculations. 

Figure [4] shows PDL5 as a function of for five val- 
ues of M. The y-axis (PDL5) is on a logarithmic scale in 
this graph; each curve on the graph represents a particu- 
lar value of M. First, notice that these curves are spaced 
evenly apart for PDL5 on a logarithmic scale. This in- 
dicates that a RAID system under the individual repair 
model with M + 1 check drives is exponentially better than 
a RAID system with M check drives and all other param- 
eters the same. That is, PDL5 ~ cf where c\ is a con- 
stant less than one. Additionally, for a fixed M, the effect 
of increasing is relatively small. These observations 
have direct implications for the design of RAID systems. 
Adding an additional check drive to a RAID will dramat- 
ically increase the reliability of that system, whereas an 



4 Model 3: Simultaneous Repair 




Figure 5: 
model. 



Markov chain for the simultaneous repair 



As a revision to the individual repair model, it is unreal- 
istic that hard drives are repaired one by one at some fixed 
rate. Instead it is more likely that if one or more drives 
have failed, a repairman would be notified, and when he 
arrived to fix the drives, he would fix all failed drives at 
once. This new model can be represented with the Markov 
chain shown in Figure [5] For this model, 



A = 



-A) Mi 

Ao -(Ai + fii) 
A! 




M: 

-(A2 + M2) 
A 2 



Mm 










-(Am + Mm) 

Am 
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with km = (T — m)X, and p, m = p.. Notice for M — 1 this 
is the same model as our first simple model, and has the 
same MTTDL. 



4.1 Results 

The effects on predicted reliability of the RAID as a result 
of this change to the model are negligible. For example, 
withM = 5, the PDL5 for the simultaneous repair model is 
0-3% lower than for the individual repair model, and the 
effect grows linearly with N. The same relationship holds 
for other values of M with a smaller constant of propor- 
tionality for smaller M. For large M this effect might be 
significant but the reliability model is not sensitive to this 
modification. 



5 Model 4: Imperfect Repair 

Say you are a small company with a RAID system, and 
one drive or system fails. You call in the appropriate em- 
ployee, but this person may not be an expert in RAID. He 
accidentally swaps out the wrong hard drive. Now you 
effectively have a RAID system with two failed drives in- 
stead of one. 

Here we will attempt to capture the effects of human 
error on the reliability of RAID systems. We will build on 
the model discussed in Section [4] using the same Markov 
chain (Figure [5]l and transition matrix A, but will use dif- 
ferent effective failure and repair rates. We suggest that 
when hard drives fail there is a probability p that in ser- 
vicing those drives some other hard drive will be dam- 
aged and the already failed drives will not be repaired; 
there is a probability 1 — p that the failed drives will suc- 
cessfully be repaired. Therefore, the effective failure and 
repair rates are Ao = T A, Xj = (T — j) A + p. p for j > 0, 

and /!/ =fi (!-/>)• 



5.1 Results 

Figure [6] shows the effects on PDL5 of considering imper- 
fect repair. Notice that even for p small, imperfect repair 
decreases the reliability of the system by several orders 
of magnitude. Doubling p decreases the reliability by at 
least one further order of magnitude. For larger M the ef- 
fect is more pronounced with a decrease in reliability of 
as much as 10 orders of magnitude. For someone who 
has designed their RAID system without considering the 
perils of service, they would need to add at least one if 
not two or more additional check drives to their RAID to 
maintain the same expected system reliability in the face 
of service hazards. 
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Figure 6: PDL5 for a model with imperfect repair for 
a variety of M and p. Notice that for p = 0, this is the 
simultaneous repair model. 

6 Model 5: Sector Errors 

Another major consideration in data reliability is the oc- 
currence of irrecoverable read errors. In particular, if a 
RAID system is rebuilding after M drive failures, and en- 
counters an irrecoverable read error on one of the remain- 
ing working disks, it will not be able to rebuild that byte 
for the RAID system. As mentioned in Section[T] irrecov- 
erable read errors occur once in every 10 14 bits, and thus 
are a common occurrence. Here, we will restrict our atten- 
tion to the particular case of sector errors, and assume that 
although sector errors are common it is unlikely that the 
same sector will fail on two or more disks simultaneously. 

We include the effects of sector errors on reliability by 
considering a two-dimensional Markov chain where the 
state ij signifies that i drives have failed, and j drives have 
sector errors, and restrict our states to i+ j < M+ 1 . There 
is a special FAIL state which indicates that data has been 
lost and corresponds to either M + 1 failed drives and no 
working drives with sectors errors or M failed drives and 
one or more working drives with a sector error. We denote 
the case where i + j = M+ 1 as i j+ meaning that i drives 
have failed and j or more drives have sector errors. As in 
the simultaneous repair model of Section [4] drives fail at 
rate A, and are simultaneously repaired at rate jx. In ad- 
dition, clean drives will develop sector failures at rate A', 
and we scrub the drives to remove the sector errors at rate 
jj,'. This model was studied in [5|, and the approach here 
is similar but generalized to an arbitrary number of check 
drives. Figure [7 depicts the Markov chain for the M = 2 
cases, and Table Tlgives the transition rates between states 
in the general case. 

Similar to our previous methodology, we can consider 
a vector q(f) that gives the probability of being in each 
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Figure 7: . Markov chain for a model of RAID reliability 
that considers the effects of sector errors for the M = 2 
case. In this model, state ij indicates i failed drives and j 
working drives with sector errors. 



state of our Markov chain as a function of time. Since 
we have been considering states indexed by two numbers, 
it is necessary to relabel our states in some one dimen- 
sional manner, but the convention by which we do this is 
unimportant. We can then use the transition rates given in 
Table [T]to write a transition matrix. Even for small M, this 
matrix is too large to include here. One can then calculate 
q(f) using (51. For numerical calculations, we assume a 
sector failure rate of X' — 1/(2 days) and a scrub rate of 
ju' = 1/(6 hours), except where otherwise noted. This 
scrub rate is unrealistically high; we will find that it does 
not matter. 
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Table 1: Table of transition rates for the sector failure 
model. We write instead of ij. Type 1 is failure 
of a drive without sector error. Type 2 is replacement of 
all failed drives. Type 3 is sector failure. Type 4 is re- 
moval of all sector errors (scrubbing). Type 5 is failure of 
a drive with a sector error. Type 6 is a failure of a drive 
in the special case where i + j — M + 1 . Since this case 
indicates that j or more drives have sector errors, the fail- 
ure of any drive regardless of whether that drive has sector 
errors moves the system to the state (;' + l,j — 1). 



Figure 8: PDL5 for a model with sector errors. This graph 
compares no scrubbing to scrubbing once an hour for M = 
2,3, and 4. 

Figure [8] shows the PDL5 for this model as a function 
of TV for various M and jj.'. Notice that for a fixed fi' ', 
this graph is qualitatively similar to Figure [4] the graph 
of PDL5 for the individual repair model. However this 
graph shows that the consideration of sector errors yields 
an estimate of PDL5 that is several orders of magnitude 
higher than the model proposed in Section[4] 

|5| discusses extensively the benefits of shorter scrub- 
bing intervals (increased jj,'), finding that more frequent 
scrubbing can substantially increase the MTTDL of the 
system and thus make it more reliable. We find that scrub- 
bing more frequently can indeed increase the reliability 
of the system, but the effect is much less dramatic than 
adding another check drive. Indeed, for the sector failure 
rate used, it was necessary to increase the scrub rate to 
once and hour to produce a discernible effect on the graph 
in Figure [8] 

6.1 Sector Errors and Imperfect Repair 

We can update our model of sector failures to also capture 
the effect of imperfect repair discussed in Section [5] sim- 
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Table 2: Table of transition rates for the sector failure 
model with imperfect repair. As compared to Table [T] 
there are a few differences. Types 1, 5 and 6 of of Table^J 
split into two types, la-b, 5a-b and 6a-b respectively, in 
order to account for imperfect repair. The second terms of 
cases lb and 5b represent the impact of imperfect repair 
which does not effect transitions from states where i = 0; 
the fractions (T — i — j)/(T — z") and j/ (T — i) represent an 
erroneous repair of a functional drive without/with sector 
errors, respectively. 
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Figure 9: PDL5 for a model with with sector errors and 
imperfect repair with p = 0.05. 



and imperfect repair. 



ply by altering the transition rates of the Markov model 
proposed in this section. Table[2]shows the transition rates 
for the sector errors and imperfect repair model. Figure [9] 
shows the PDL5 for this model with p = 0.05. Notice 
that the reliability estimates under this model are up to 
6 orders of magnitude worse than for the model of sec- 
tor errors without considering imperfect repair, and up to 
14 orders of magnitude worse than the model without ei- 
ther sector failures or imperfect repair, for the values of M 
shown. 



7 Delay of Service 




M = 3, n = 1/(6 hours) 
M = 3, u = 1/(1 week) 

M = 4, u = 1/(6 hours) 

- - -M = 4, u = 1/(1 week) 

M = 5, u = 1/(6 hours) 

M = 5, u = 1/(1 week) 



We have just seen the hazards of service to the reliability 
of the system, so perhaps a way to mitigate these haz- 
ards is to plan to service the RAID system less frequently. 
Section|2]took this idea to the extreme, where we consid- 
ered a system in which there is no service to hard drives. 
Here we will consider a system where service happens, 
but might be delayed. This could correspond to a situa- 
tion where the RAID system is only serviced on week- 
ends when their are fewer users and thus it might be more 
convenient for the system to be tied up in rebuild. 

So far, we have modeled the time to repair as an expo- 
nentially distributed random repair rate ji = 1/(6 hours). 
For a simple model of delayed repair, we will consider the 
effects on reliability of decreasing the rate of repair in the 



Figure 10: PDL5 for a model with with sector errors and 
imperfect repair. This figure compares a mean time to 
repair of three hours to 1 week for M = 3,4, 5. 



Figure 10 compares the reliability of a RAID system 



model proposed in Section 6. 1 that considers sector errors 



that has a mean time to repair of 3 hours (the base model) 
to one with a mean time to repair of 1 week (delayed 
service) for M — 3,4,5. The delay of service has a dra- 
matic impact on the reliability of the system, particularly 
for large Af. If one intended to run a RAID system with 
delayed service, one would need to add one, two, or per- 
haps more additional check drives to compensate for the 
reduced reliability. 
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8 Model 6: Delay Systems 

So far, we have modeled the time to repair and rebuild 
broken drives as a single, exponentially distributed ran- 
dom variable with rate jj,. One can capture more nuance 
of the repair mechanism by modeling part of this process 
as a fixed time event requiring time h to complete. This 
might capture the significant fixed time required to repair 
and rebuild a RAID system, whereas the initiation of such 
service might happen at a rate ji. We call this fixed time 
event a delay. For modern RAID systems, due to the size 
and the number of drives, repair constitutes a significant 
portion of the system's usable life and thus delays may 
substantially impact our reliability model. 

The mathematics of delay systems is substantially more 
complex than the models previously discussed. To estab- 
lish the theory, we begin with a toy example of a delay 
differential equation (DDE): 



dy 
dt 



■■ay(t)+bH(t-l)y(t-l), < f < °o, 



where H(t) = f_„,S(t')dt' is the Heaviside step function. 
Let y(s) = JJ° e~ st y{t) dt be the Laplace transform of y(t). 
Using the usual calculus of Laplace transforms (i.e. in- 
tegration by parts for derivatives and change of variables 
for shifted arguments), we obtain sy(s) — y(0) = ay(s) + 
be- s y{s). Thus,y(s) =y(0) / '{s - a-be~ s ). Lety(O) = 1 
be the initial condition. Using the inverse Laplace trans- 
form (Mellin Inversion Formula) we get 



y{t) 



lim 



2ni p- 



a+ip 
o—ip S 



r st ds 



be- 



where a is chosen so that all poles of the integrand lie to 
the left of the line 3is = a, and is arbitrary, otherwise. 
The poles of the integrand, i.e. the roots of the equa- 
tion s — a = be~ s , are the characteristic numbers of the 
problem. The roots of the equation se s = z define the 
Lambert W function, which is a multi-valued function in 
the complex domain. Our equation is (s — a)e s = b or 
(s — a)e s ~ a = be~ a . Hence, the characteristic numbers 
are: s = a + W(be~"). Provably, there are infinitely count- 
ably many values of s. If the path of integration can be re- 
placed with a large loop containing all characteristic num- 
bers, and if all characteristic numbers are simple roots Q 
of the equation then, using the residue calculus, we find 
the formal solution: 



y(f) = E i 



+ be~ s i 



where the summation is over all characteristic numbers 
{sj}. The full mathematical analysis of a system with 

'All roots are simple unless &e~' D ~'' =— 1. 



delay is subtle (such as convergence of the formal series 
above) and will not be attempted here. It should be noted 
that numerical integration of DDE's does not present any 
difficulty, and numerical results can be readily obtained. 

To obtain a model of RAID reliability, we must study 
systems of DDEs. Here, we will consider only the follow- 
ing limited form: 



dt 



q(f ) = Bq(t)+CH{t - h)q(t - h) 



(10) 



where < t < oo, B and C are n x n matrixes and q(f ) is a 
vector-valued function with n entries. We take the Laplace 
transform and obtain: 

zq(z)-q(0) =Bq(z)+exp(-zh)Cq(z) 

where q(z) is the Laplace transform of q(f). Hence, the 
Laplace transform of the solution is 

q(z) = (z/-B-exp(-z/i)C)" l q(0) 

where R{z;B,C,h) = (z/-B-exp(-z/i)C) _l is a com- 
plex, matrix-valued, meromorphic function which re- 
places the ordinary resolvent for a non-delay linear sys- 
tem. We note that for fi = 0we have R(z;B,C,h) = 
R(z;A) where A = B + C (the right-hand side is the or- 
dinary resolvent). The theory of moments of the solu- 



tion discussed in Section 2.3 carries through to the delay 
case, and in particular involves only the Laurent series of 
R(z',B,C,h) at z — 0. The computation of the probabili- 
ties, i.e. the vector q(f), presents similar theoretical diffi- 
culties as the toy example in the previous section. Nu- 
merical calculations are straightforward. An analytical 
approach must deal with the poles of R(z;B,C,h), which 
are the roots of the transcendental characteristic equation 
det(z/-B-exp(-z/i)C) =0. 

8.1 Delay with One Check Drive 

Now we will modify the individual repair model for 



RAID 4/5 discussed in Section 3.1 to include delay. We 



do so by assuming that disk repair takes no time (replace- 
ment of the drive, reconstruction using data on functional 
drives, etc.), but after that we simply wait h units of time 
before adding the drive to the RAID. This is equivalent 
to assuming that repair takes time h but that drives cannot 
fail during repair and reconstruction. This model is not re- 
alistic and serves only as an illustration of our approach. 
This model yields the system of differential equations: 



q {t) = -{N+l)kgo(t) + nH(t-h)gi{t-h),(ll) 
gi (0 = (tf+l)A9o(0-(tfA+/i) 9 i(0, (12) 

q 2 {t) = Nlq x {t). (13) 



10 




Figure 11: RAID with M = 1, T = 2, A = 0.01 and = 
0.01 (both have units of inverse time); delayed repair of 
h = 300 time units. MTTDL is the area above the graph 
of q 2 (t ) as < t < °o. According to our calculations, this 
area can be made arbitrarily large by increasing delay (h). 



The delay term represents disks for which repair started 
at time t — h and are coming on-line at time t . This system 
is of the form (jTOj, where 
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(JV+1)A 
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NX 

















The first moment of the time of failure is: 
(22V+1) X+ii 



(N 2 +N) A 2 



Clearly, when h = 0, we reproduce the result of Sec- 
tion |3 - 1 1 Unexpectedly, positive delay results in larger 
m\(£?). By delaying the drive replacement we could in 
principle increase the mean time to failure indefinitely. 
This result is counterintuitive, and indicative of how unre- 
alistic this model is. In simulations, the state vector q ex- 



hibits oscillatory behavior, as shown in Figure 1 1 which 
is typical when studying delay phenomena. The value of 
PDL t = <72,o(0 may also be obtained in principle via the 
inverse Laplace transform, but due to the implicit depen- 
dency on the roots of the characteristic equation, is not 
very tractable analytically. 

8.2 Delay with One Check Drive and Fail- 
ure During Rebuild 

To improve the model of the previous section, we explic- 
itly track drive's progress while it is rebuilding, and allow 
it to fail during rebuild. The cost of this improvement 
is increased mathematical complexity, as the model is a 
PDE. Under the notation of the previous section, we as- 
sume that after the drive is replaced, it requires h units of 
time to restore the data, but now the drive may fail during 
recovery. The RAID may be in three functional states: no 



drives failed, one drive failed and waiting to be replaced, 
and one drive being reconstructed. In addition, there is 
the fourth, implicit state: the failure state. The probability 
distribution consists of three components: 

1. qo(t) - the probability that a RAID system has no 
failed drives; 

2. q\{t) - the probability that a RAID system has one 
failed drive waiting to be replaced; 

3. q2(t,x) - the probability density of RAID systems 
with one drive being reconstructed; thus, the prob- 
ability that a system is being reconstructed and the 
reconstruction has lasted between x and x + Ax units 
of time is q 2 (t,x)Ax. Thus, < x < h. 

The system of differential equations (a generalization of 
Kolmogorov-Chapman equations) for this system is: 



dt 

dt 

hljtiX) 
dt 



= -(N+l)Xq (t) 
= (N+l)Xq (t)- 
= -(N+l)Xq 2 {t,x 



+ q2{t,h), 

(iVA + ju)?iW 
dq 2 {t,x) 



(14) 

(15) 

(16) 
(17) 



This is a system of partial differential equations where the 
last equation is a boundary condition. Most of the terms 
are familiar, except for the following terms: 



1. 



q2(t,h) in equation ( 14 1, which is a contribution from 
the drives for which reconstruction ended; 

— dl>2 j' x ,x ^ in equation jl6jl which is due to the RAID 
reconstruction progress; 



3. Equation ( 17 i accounts for replacing failed drives, 
for which reconstruction begins immediately. 

Starting with all RAID systems with no failed drives and 
no drives being reconstructed, we arrive at the following 
initial conditions: qo(0) = 1, ^i(0) = 0, ^2(0,^) = for 
< x < h. It is possible to obtain the MTTDL for this 
system by the methods previously introduced. The system 
is solved by considering the Laplace transform in the time 
domain: 



zqo(z)- 

zqi(z)- 



-90(0) 

-qi(o) 



-(N+l)Xq (z) 
(N+l)Xq (z)- 



zq2(z,x)-q 2 (0,x) 



= -(N + 
= Mfl(z) 



l)Xq 2 (z,x)- 



±qi(z,h), 
(NX+p.) qi (z), 
"\{z,x) 



The strategy to solve the system is obvious: first we solve 
the third equation as an ODE: 



qi{z,x) 



-{{N+l)l+z)x & 



q 2 {z,Q) 



[ X e-((."+W+z)« q2{0}X . 
Jo 



u) du. 
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For given initial conditions, this reduces to qx(z,x) 



\ie 



qi(z). Next, we substitute this result into 



the first equation and solve the resulting algebraic lin- 
ear system. If q(z) = {qo{z),qi{zj) then the solution 
for given initial conditions may be represented as q(z) = 
(z/-A(z)) _1 eo, where eo = (1,0) and 



-(N+l)X Me -((tf+l)*+# 
(N+l)k -(Nk+fi) 



(18) 



A(z) = 
We have 

rh 

1-PDL, = P[&>t) =?o(0+9l(0+ / 9i{t,x)dx. 

Jo 



The Laplace transform is qo(z) +q\ (z) +/ $2(z,x)dx, or 
explicitly: 



<?o(z) + <?i(z)- 



(N+l)A+z 



l-< 



The 0-th coefficient of the Taylor expansion with respect 
to z yields: 



MTTDL 



(2A'+l)A+(2- e -( A,+1 ) A/! ) fi 

(N 2 +N) X 2 +(l-e-( N + i ) Xh )(N+ l)Xn 

(2N+l)X+y, _ ((fiN + n) X+y 2 ) h 
(N 2 +N) X 2 WX 2 + " 



Again, for /i = 0we reproduce the result of Section 3.1 



One can check that MTTDL is monotonically decreasing 
function of h for all h > 0. The limit as h — > °° is: 

(2W+1) A+2ji 



(N+l)k {Nk+n) 

As in the previous examples, obtaining explicit solutions 
for PDL, presents a challenge, as it depends on the abil- 
ity to solve a transcendental equation akin to the Lambert 
equation. 

We should note that the delay system ([TT)-([T3"]> studied 
in the previous section may be formally obtained from the 
system ( 14i-( 17 1 by dropping the term ~(N+ l)A<£j(0 m 



equation (|16[l. 

The technique introduced here is flexible enough to 
handle very sophisticated RAID models, for instance, 
M > 1 and the infamous "bathtub curve" (non-uniform 
failure rate of a drive, depending on its age). However, 
presenting these results is beyond the scope of this paper. 

9 Silent Data Corruption 

Occasionally, data is corrupted on the disk in a way that 
is not detectable by hardware. This incorrect data is then 
read and delivered to the user as though it were correct. 
An advantage of Reed-Solomon codes is their ability to 



detect and correct such errors. Reed-Solomon codes are 
an example of maximum distance separable codes, about 
which there are well known results in there area of coding 
theory. See ifTSl , for example. 

RAID systems group bytes on the drives into words, 
and words in the same position on each data drive are 
combined to form the word in that position on the check 
drive. A RAID system with M check drives and all drives 
working can detect up to M (and sometimes more) corrupt 
words in any position. Such RAID systems can correct ¥ 
corrupt words in a any position. 

Given that silent data corruption occurs relatively infre- 
quently (see [4 1 for a discussion of many types of errors), 
it is sufficient to design RAID systems that are able to 
detect and correct a single error at a time. This requires 
N + 2 drives to be operational at any time. If we think of 
PDL, as a function of N and M for one of the models pre- 
viously discussed, then a system with N data drives and 
M check drives has a probability of entering a state where 
it cannot check for and correct silent data corruption of 
PDL,(N + 2,M - 2) under that model. Put differently, if 
you have designed a RAID system that meets your reli- 
ability needs without considering silent data corruption, 
then the addition of two check drives allows you to check 
for and correct silent data corruption with similar reliabil- 
ity. 



10 Conclusions 

Given 100 drives, how do we design a RAID system to 
give the best reliability with the highest data rate (percent 
of drives used to store data)? We have studied the follow- 
ing possible designs: 

Double Mirroring Each drive is mirrored twice. This yields a 
data rate of 33/99. 

Two Independent RAID 6 Divide the 100 drives into two in- 
dependent 50 drive RAID 6 systems (N = 48, M = 2). If 
each RAID system has a PDL5 of q, then the PDL, for the 
two systems together is PDL, = 1 — (1 — q) 2 . 

Layered RAID 6 Divide 100 drives into 10 RAID 6 groups. 
Let the 10 groups act together as a RAID 6. Let m be the 
MTTDL for each ten drive group. Then we can model 
PDL, for the system by setting A = l/m. Note: for the 
sector failure model, we only model sector failures at the 
first level, not the second. 

Large RAID A single large RAID system with M = 4 or 11, 
andN= 100 -M. 

Clearly, the single large RAID system is superior to other 
designs considered. 
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Table 3: Comparison of PDL5 for several RAID designs. 
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